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Status  of  Effort:  During  the  past  year  I  have  been  continuously  working  on  the  areas 
of  Survival  analysis,  reliability  theory,  coverage  problems,  and  the  bootstrap  method  with 
applications.  Some  of  these  eflPorts  were  published  (96-97)  or  to  be  published  in  the  near 
future.  See  the  publications  list  below.  Currently  I  am  working  with  my  students  on  the 
problem  of  “adapt  Cox  models  to  cased-control  study  in  Survival  analvsis”.  The  method 
if  developped  will  have  numerous  applications  in  seeking  risk  factors  in"  various  fields. 

Accomplishments  /  New  findings: 

Some  of  my  recent  findings,  under  the  AFOSR  grant  F49620-94-1-0035,  on  the  areas 
of  reliability  and  survival  analysis  are  particularly  encouraging.  These  results  together 
with  their  implications  are  briefly  described  as  follows: 

It  is  well-known  that  the  Kaplan-Meier  (K-M)  estimator  often  behaves  very  unstable 
and  unreliable  on  the  tail  part  of  the  survival  function  due  to  heavy  censoring.  The 
distributional  properties  responsibile  for  these  behaviors  are  largely  unclear.  Although 
there  have  been  some  discussion  along  the  line  in  recent  literature  (Gill  (1983),  Ying  (1989) 
and  Stute  and  Wang  (1993)),  the  general  problems  of  to  what  extent  the  valid  inference  can 
be  drawn  and  what  is  the  nature  which  dominates  the  convergence  (or  divergence)  of  the 
tail  estimator,  are  virtually  unknown  and  unanswered.  The  answers  to  these  questions  will 
not  only  add  novel  knowledge  to  the  statistical  Uterature,  more  importantly,  they  provide 

guideUnes  and  suggest  applicable  procedures  to  the  daily  users  routinely,  as  we  shaU  see 
below. 

Roughly  speaking,  our  recent  findings  (some  appeared  in  June,  1997  Ann.  Stat.) 
indicate  that  the  tail  behaviors  of  K-M  cmve  are  determined  by  a  set  of  simple  necessary 
and  sufficient  conditions.  There  are  two  basic  types  of  convergence  involved  here,  the 
STRONG  law  and  WEAK  law  (in  probabifity).  Therefore  there  are  two  conditions  which 
determine  K-M  curve’s  behavior,  depending  on  strong  or  we^  laws  respectively.  If  we 
denote  by  th  the  right  limit  of  support  of  iL  =  FG,  where  F  and  G  are  usual  survival 

function  of  interest  and  censoring  survival  function,  respectively,  the  K-M  estimator  can 
be  defined  as 

p.(t) = 1  -  n  (1  -  f#). 

0<s<t 

where  W„(s),  and  Yn{s)  are  the  usual  counting  processes  based  on  complete  data  and  risk 
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size  at  remaining  time  s,  respectively. 

We  outline  major  findings  in  the  following  RESULTS:  The  first  result  gives  the  nec¬ 
essary  and  sufficient  condition  of  rates  convergence  in  strong  law. 


RESULT  1.  (STRONG  LAW) 

For  any  0  <p  <  we  conclude  that 

sup  nP|F„(f)  -  F(t)|  =  o(l)  a.s. 

t<TH 


if  and  only  if 


dF  <  oo. 


(1) 


The  second  result  extents  Lo  and  Singh’s  (1986)  iid  representations  from  finite  interval  to 
whole  real  line.  Note  that  the  iid  variables  appeared  in  the  following  expression  are 
different  from  that  in  Lo  and  Singh  (1986). 


RESULT  2.  (REPRESENTATIONS  OF  -  F) 

If  ( 1)  holds  for  some  p,  0  <  p  <  then 

-no  =  +7nW 

i=l 

where  ^(z^,  5i,  t)  =  and 

sup  |7n(0l  =  0(n-Wn~^)(logn)2)  a.s. 


The  following  result  tells  us  the  rates  of  week  convergence,  which  is  of  fundamental  im¬ 
portance  in  statistics. 


RESULT  3  (  WEAK  LAW) 
Assuming  that  0  <  p  <  i.  Then 


sup  K(F„(t)-F(t)|  =  Op(l) 

t<TH 


if  and  only  if 
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Furthermore,  we  conclude  that 


sup  |nP(F„(<)  -  F(0|  =  Op(l)  if  and  only  if 

t<rH 


lim, 


t-^TH 


{/t  (1  -  G)dFY-r> 

1  -  G{t) 


=  0. 


A  nature  question  arises:  Does  exact  order  of  weak  convergence  exit?  If  it  does,  how  to 
find  it?  ’ 

The  following  result  4  and  5  answer  the  question  affirmatively. 


RESULT  4. 

There  exists  a  unique  p,  0  <  p  <  i  such  that 


^  -  F{t))\  <m}  =  l. 

t<TH 


This  result  also  tells  us  that  for  that  unique  p,  the  following  inequalities  must  hold: 
-oo  <  limt-^r„  (log(l  -  H{t))  -  (1  -  p)  log( jT  "" (1  -  G)dF)) 

<  (log(l  -  H{t))  -  (1  -  p)  logCjT  "(1  _  G)dF)) 

<  oo. 

Therefore,  we  can  consider  a  nature  estimator  of  p  as  follows: 


RESULTS  5.  (ESTIMATION  OF  p) 

Consider  a  simple  linear  regression  problem  with  observations  {(log  j,  log(A’„(.^/,^_  A)}, 
where  (Z^)}  are  the  ordered  values  of  {Zi}.  If  we  treat  logj  as  \he  covariateand 
log(A’n(-^(n-i)))^  response  variable  in  the  linear  model,  then  p  can  be  estimated 

consistently  by  p,  where  1  -  p  is  the  slop  estimator  described  by  the  above  regression  prob¬ 
lem. 


Although  the  proposed  estiamtor  p  is  consistent,  the  distributional  properties  of  p  is 
not  clear  at  the  moment.  Further  study  is  needed  toward  this  important  direction. 

Another  question  needs  to  be  explored  is  that  from  Result  4  above,  there  exists  a 
unique  p  such  that  nP(F„(t)  —  F{t))  forms  a  tight  sequences  of  processes  on  £)[0,  tjj].  The 
distribution  of  this  limiting  process  is  not  identified,  however.  One  may  use  bootstrap  to 
approximate  the  limiting  distribution  in  practice  without  knowing  the  theoretical  distri¬ 
bution,  the  trouble  is  that  the  current  case  is  not  regular  and  the  weU  known  bootstrap 
theory  cannot  be  applied. 
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The  phenomenon  discovered  here,  we  believe,  is  both  practically  and  theoreticallv 
important.  To  our  best  knowledge,  nothing  of  this  kind  has  been  reported  in  the  statistical 
literature.  We  also  believe  that  the  above  phenomenon  is  not  isolated.  As  a  matter  of  fact 
we  expect  that  similar  phenomenon  will  also  occur  in  other  incomplete  d^a  p^^^ 

pxnln^  Strongly  believe  that  the  techniques  developed  here  can  be  employed  to 

explore  further  important  unknown.  cmpioyea  ro 


Progress  on  the  area  of 

THE  CASE-CONTROL  STUDY  WITH  FAILURE  TIME  DATA 

In  most  case-control  studies,  it  is  necessary  to  define  a  specific  population  where  the 
cases  and  controls  are  randomly  selected  from  the  individuals  who  developed  disease  in 
specificed  accession  period  or  from  the  individuals  who  are  disease-free  by  the  end  of 
the  case  accession  period,  respectively.  Our  situation  is  slightiy  different,  however  The 

tfZdiviT^'l  “7  «  time-dependent  population  Pffro)  consisting 

If  ^duals  who  meet  ceHam  inclusion  criteria  set  before  the  study  and  are  known  to 
have  developed  disease  before  current  calendar  time  r  =  tq.  For  example,  suppose  we  are 
mterested  in  a  lung  cancer  study  with  risk  variables  A.  The  current  calendar  time  is  1996 
7^  t  c^jteria  may  include:  (1)  individuals  who  reside  at  certain  areas, 

An  present  time  tq. 

An  individual  who  hved  in  the  specified  area,  now  48  and  developed  lung  cancer  6  yea^ 

ago  (regardless  dead  or  still  alive)  would  be  considered  as  a  sample  point  in  Pifrn)  The 

su^ival  time  T  of  interest  for  this  individual  is  clearly  22  years  (since  42-20=22)  An 

howlte?^  diagnosed  lung  cancer  in  1962  at  age  of  53  is  not  included  in  P^{To), 

Siimlarly,  the  controls  sample  are  randomly  selected  from  a  time-dependent  population 
Po(ro)  consisting  of  those  individuals  who  meet  certain  similar  inclusion  criteria  and  known 

frpp  example,  an  individual  who  is  35  years  and  is  disease- 

free  contributes  at  least  15  years  to  the  survival  time  T  of  interest.  In  other  words  T  is 

every  individual’s  survival  times  T  in  Po(ro)  is  censored.  This  is 
why  the  individual  is  qualified  to  serve  as  a  control. 

Note  that  some  of  the  cases  in  Pi(ro)  may  be  censored  (either  left  or  right)  for  various 
exdudP  ^  clear  and  simple  illustration  of  our  proposed  method,  we  have  chosen  to 
exclude  tks  possibility  that  Pi(to)  may  have  incomplete  observations.  But  the  method 
proposed  here  does  extend  to  cover  these  more  complicated  situations. 

wh-1  that  P I  (ro)  consists  of  all  units  with  complete  survival  time  {Ta,  aePi  (tq)} 

while  Po (to)  consists  of  all  individuals  with  right  censored  surdval  times  {a;  7cPo(To)| 

time  d  ^  method  considered  here  assumes  that  a  suitable 

vid„at  ^^'''^Tu^°^p  {A  ^).^o(r);r  =  calendar  time}  can  be  defined.  Some  indi- 
viduals  may  fall  m  Po(r)  imtmlly  and  subsequently  develop  the  disease  and  become  part 

ot  Pi(r  )  for  r  >  r.  For  each  individual  with  risk  variables  X{t)  (this  r  corresponds  to 
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survival  time  which  is  different  from  t),  the  disease  incidence  rate  is  deffned  as,  according 
to  Cox  proportional  hazards  models, 

A(T  =  t\X{t))  =  Ao(t)  exp{X{t)§^}. 

FVom  now  on  we  shall  restrict  our  discussion  on  the  case  that  risk  variables  are  inde¬ 
pendent  of  time  t.  We  shall  return  and  comment  on  this  general  case  later.  It  is  plausible 
to  estimate  (3  but  not  Ao(t)  based  on  the  case-control  data.  Suppose  that  n  cases  and  m 
controls  are  random/y  selected  from  Pi(ro)  and  PoC'To)  respectively.  Let  .2^  =  1  if  someone 

IS  included  in  the  sample,  Z  =  0,  otherwise.  (Note  that  the  random  indicator  variable  Z 
dehned  here  may  depends  on  tq). 

Suppose  that  n  cases  and  m  controls  are  randomly  selected  from  Pi{tq)  and  Po{to) 
separately.  Let  Z  =  1  if  someone  is  included  in  the  sample,  Z  =  0  otherwise.  (Note  that  Z 
may  depend  on  tq  and  X,  but  given  the  disease  status  (i.e.;  either  in  Pi(ro)  or  Pnfrn)  Z 
is  independent  of  X  at  t  =  tq,  however). 

The  full  likelihood  based  on  the  observed  data  can  be  written  as 

^  m 

Lik  =  =  ti,Xi\Ti  <  Ci)YlP{Tj  >  Cj,Cj  =  Cj,Xj\Tj  >  Cj) 

1=1  j=i 

—  LiX  1,2,  say 
Li  can  be  further  written  as 


n 

Li  =  =  iuXi\Zi  =  \,Ti<  Ci), 


since  given  T  <  C,  T  and  AT  are  independent  of  Z. 


We  then  arrive  at 


n 

=  n  =  *il4  =  1,  r,  <  c,)p(Xi|Ti  =  tuZi  =  i,Ti<  Ci) 

=n/w=«.T,<CiiXi.Zi=i)^m^, 

It  is  clear  from  the  sampling  plan,  P{T  <  C\Z  =  1)  =  Likewise,  Lj  can  be  written 

d>S 

m 

=  n  i  >  Cj  =  C,|X,. 

j=i  <  (-'j\Zj  =  1) 

and  P(r  >  C\Z  =  l)  =  We  attempt  to  maximize  Lx  x  L2,  subject  to  the  following 

PrkTiQfrDinf  c  ® 


and  P(r  >  C\Z  =  1)  = 
constraints 

n 

n  -f  m 


=  X]  ^  ^  =  l)P{X  =  x\Z  =  1) 
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m 


n  +  m 


=  p(r  >  c\x  =  x,z  =  i)p{x  =  x\z  =  1), 

{?} 


the  summation  runs  over  all  possible  exposure  values,  and  will  be  replaced  by  integral 

noTof  similar  to  Anderson  (1972)  and  Prentice  and  Pyke 

(1979)  shows  that  this  constrained  maximum  likelihood  estimate  (MLE)  is  the  same  as  the 
unconstrained  MLE  which  maximizes 


n  <  Ci\Xi,  Zi  =  l)Y[  P{Tj  >  Cj, Cj  =  Cj\Xj, Zj  =  1)  =  L*  X  L^,  say. 

3=1 


Now  LI  is  proportional  to  L?*  =  ,  ffT,-  =  t-  T-  r  \Y  \  r*  _4.-  i 

tn  /■*•  _  pf'T  ^  A  ^  ^2  ^  proportional 

^  ^  ^3 ‘3  —  This  says  that  if  the  prospective  Cox  model 

were  app  led  to  the  case-control  data,  as  if  the  sampling  were  prospective,  the  likelihood 
function  would  he  proportional  to  the  prospective  likelihood.  Therefore  one  can  estimate  the 
parameters  0_  exactly  the  same  as  the  ordinary  propspective  partial  likelihood  method  The 
rnajorjifference  is,  the  base  line  hazard  function  Ao(0  and  corresponding  cumulative  hazard 
ri*  ~Jo^^o{u)du  are  no  longer  estimable,  as  evidenced  by  C.2  and  the  fact  that  “although 
Lx  X  L2  IS  proportional  to  LI  x  L^,  as  demonstrated  above,  the  constant  factor  (ratio) 
between  the  two  products  depends  on  the  quantities  such  as  P{Z  —  IjT  —  t,T  <  C  X)  and 
P{Z  —  l\T  >  c,T  >  c,  X)  which  are  not  estimable  under  the  case-control  design”  These 
two  quantities  here  involve  the  knowledge  of  the  size  of  Pi(ro)  and  Po(ro),  which  is  generally 

non-existent  based  on  a  case-control  design,  unless  additional  sources  of  information  are 
available. 


FURTHER  WORK.  The  distributional  properties  of  the  estimator  0  derived  above 
IS  important.  We  proposed  to  explore  this  fuUy  in  the  near  future.  The  Issues  of  how  to 
extend  our  method  to  accommodate  more  general  case-control  studies  which  involve  contin- 
uous  risk  variables  and  time-dependent  exposure  are  certainly  interesting  and  important. 
With  minor  modifications  of  our  method  we  feel  we  do  can  handle  the  time-dependent 
cases  without  much  difficulties.  To  cover  the  general  cases  involving  arbitrary  continuous 
covariates  will,  however,  require  smoothing  techniques  and  special  treatment,  and  these 
deserved  further  study.  We  plan  to  include  these  problems  in  our  future  study. 


discussed  our  method  under  the  simplest  design  for  case-control  data, 
e  believe  we  can  do  various  extensions:  Various  degrees  of  matching  or  stratification  can 
be  built  into  the  design  and  case-control  sampling  fractions  can  be  allowed  to  vary  among 
marked  sets  or  stratus.  The  later  issues  is  particular  relevant  to  the  problem  raised  earlier* 
what  if  we  collect  another  case-control  data  5  years  from  now?  Can  we  combine  these  two 
data  sets  coUected  in  distinct  calendar  times  to  conduct  a  coherent  analysis?  The  answer 
to  this  question  is  “yes”  under  the  simple  design  described  above.  Although  the  sampling 
probabilities  may  vary,  depending  on  distinct  calendar  times  and  the  corresponding  time- 
dependent  populations  {Px{t)}  and  {Po(t)},  the  likelihood  function  can  stiU  be  Lived 
and  shown  to  be  proportional  to  the  likelihood  function  derived  from  a  prospective  study.  It 
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IS  not  clear  to  us,  however,  whether  this  desirable  property  stiU  holds  in  more  compUcated 
designs  which  involve  matching  and  stratifications.  This  together  with  various  practical 
variations  of  the  problems  constitute  a  further  area  of  our  future  study.  Furthermore,  to 

test  our  method,  we  plan  to  apply  our  method  to  various  existing  data  collected  from 
earlier  cancer  studies. 
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